Normalization from Assignment 1

Downloading required packages

(Zhu et al. 2008) (Xie 2023) (Robinson, McCarthy, and Smyth 2010) (Morgan 2022) (Durinck et al. 2009) (Gu et al. 2014) (Gu, Eils, and Schlesner 2016) (Wickham 2016) (Slowikowski 2023) (Lieberman et al. 2020) (Raudvere et al. 2019)

if (!requireNamespace("GEOmetadb", quietly = TRUE))
  BiocManager::install("GEOmetadb")

if (!requireNamespace("knitr", quietly = TRUE))
  install.packages("knitr")

if (!require("edgeR", quietly = TRUE))
  BiocManager::install("edgeR")

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

if (!requireNamespace("biomaRt", quietly = TRUE))
  BiocManager::install("biomaRt")

if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
  BiocManager::install("ComplexHeatmap")

if (!requireNamespace("circlize", quietly = TRUE))
  BiocManager::install("circlize")

if (!requireNamespace("ggplot2", quietly = TRUE))
  install.packages("ggplot2")

if (!requireNamespace("ggrepel", quietly = TRUE))
  install.packages("ggrepel")

Calling required packages

library(BiocManager)
library(GEOmetadb)
library(knitr)
library(edgeR)
library(biomaRt)
library(ComplexHeatmap)
library(circlize)
library(ggplot2)
library(ggrepel)

Important Note

On 2023/03/07 asked in lecture if I could randomly sample from data to reduce computation load. I was recommended that I should try and use the whole data. Tried to use whole data set but it was too large and would crash my R, and R studio. On 2023/03/14 posted in discussion board if I could random sample to reduce the load on the computer and reduce runtime errors. Did not get a reply by 2023/03/14 midnight so will be going with random sampling. After reducing the load did not get any crashes or errors I have gotten before with a large file.

Download the data

GSE152074 raw data supplemntary file downloaded

sfiles = getGEOSuppFiles('GSE152075')
fnames = rownames(sfiles)
# there is only one supplemental file
readData = read.table(fnames[1],header=TRUE, check.names = TRUE)

Assess

Add 1 to all values of data so later on when conducting log2(cpm) we can avoid negative infinity values. (Advised by Professor Isserlin)

readData <- readData + 1

Setting first column as gene id for future format purposes

#Place rownames in first column for future format purposes
inter <- data.frame("HUGO" = rownames(readData))
geneData <- cbind(inter$HUGO, readData)
colnames(geneData)[1] <- "HUGO"

Clean

Remove any outliers that does not have at least 2 read per million in n of the samples. We set this as 2 since we add 1 to all of our dataset in the beginning of the code to have better plots. Denoting n as the smallest group of replicates which is the control group of 53. Using n = 53 conduct the removal of low counts.

#translate out counts into counts per millison using 
#the edgeR package function cpm
cpms = cpm(geneData[,2:485])
rownames(cpms) <- geneData[,1]
# get rid of low counts
keep = rowSums(cpms >2) >=53
geneData_exp_filtered = geneData[keep,]

Remove version numbers if they exists on gene id(HUGO) column. This makes it easier for mapping later on.

geneData_exp_filtered[,1] <- gsub("\\.[0-9]", "", geneData_exp_filtered[,1])

Map

#Mapping the name using biomatr
# list available gene annotation databases
bio <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
conversion_stash <- "geneMapping.rds"
if(file.exists(conversion_stash)){
  geneMapping <- readRDS(conversion_stash)
} else{
# convert column of gene IDs to Hugo symbols
geneMapping <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"),
                     mart = bio,
                     filters = "hgnc_symbol",
                     values = geneData_exp_filtered[,1])
saveRDS(geneMapping, conversion_stash)
}

Combine the mapped gene data to original data

#Merge the data
mergedData <- merge(geneData_exp_filtered, geneMapping, by.x = 1, by.y = 2)
#remove duplicate rows in the gene data
mergedDataNoDup <- mergedData[!duplicated(mergedData[,1:485]),]

Apply Normalization

Randomly sample data to reduce the size of sample. Original sample is too large leading to computation errors due to the limitation of author’s computer.

set.seed(12345)
randomSamplePOS <- sample(mergedDataNoDup[2:431], 25)
randomSampleNEG <- sample(mergedDataNoDup[432:485], 25)
randomSample <- cbind(randomSamplePOS,randomSampleNEG, mergedDataNoDup$ensembl_gene_id, mergedDataNoDup$HUGO)

Define groups to use in normalization

samples <- data.frame(lapply(colnames(randomSample[1:50]), 
        FUN=function(x){unlist(strsplit(x, 
                        split = "_"))[c(2,1)]}))
colnames(samples) <- colnames(randomSample[1:50])
rownames(samples) <- c("patients","cell_type")
samples <- data.frame(t(samples))

Applying TMM to data

filtered_data_matrix <- as.matrix(randomSample[1:50])
rownames(filtered_data_matrix) <- randomSample$`mergedDataNoDup$ensembl_gene_id`
d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)

d = calcNormFactors(d)

normalized_counts <- cpm(d)
#add columns of ensembl and hgnc id

normalized_count_data = data.frame(normalized_counts)
normalized_count_data$ensembl_gene_id <- mergedDataNoDup$ensembl_gene_id
normalized_count_data$hgnc_symbol <- mergedDataNoDup$HUGO


#This is a duplicate ensembl id that is giving errors when running code.
normalized_count_data <- normalized_count_data[-c(1902),]
plotMDS(d, labels=rownames(samples),
 col = c("darkgreen","blue")[factor(samples$cell_type)])

Figure A: MDS plot from Assignment 1 # Dispersion

model_design <- model.matrix(~samples$cell_type+0)
d <- estimateDisp(d, model_design)

Graphing the BCV

plotBCV(d,col.tagwise = "black",col.common = "red",)

Figure B: BCV plot showing the biological coefficient of variation of samples

plotMeanVar(d, show.raw.vars = TRUE,
 show.tagwise.vars=TRUE, NBline=TRUE, 
 show.ave.raw.vars = TRUE,show.binned.common.disp.vars = TRUE)

Figure C: Mean variance plot of samples # Differential Gene Expression

Answer the following questions. 1. Calculate p-values for each of the genes in your expression set. How many genes were significantly differential expressed? What thresholds did you use and why? 2. Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction? 3. Show the amount of differential expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest. 4. Visualize your top hits using a heat map. Do you conditions cluster together? Explain why or why not.

1. 2. Calculate P-Value, and Multiple hypothesis testing

LIMMA

p-value calculation using LIMMA

model_design <- model.matrix(~ samples$cell_type )

Patient Variability

Not taking into account patient variability

expressionMatrix <- as.matrix(normalized_count_data[,1:50])
rownames(expressionMatrix) <- 
  normalized_count_data$ensembl_gene_id
colnames(expressionMatrix) <- 
  colnames(normalized_count_data)[1:50]
minimalSet <- ExpressionSet(assayData=expressionMatrix)

#Fit our data to model
fit <- lmFit(minimalSet, model_design)
fit2 <- eBayes(fit,trend=TRUE)
topfit <- topTable(fit2, 
                   coef=ncol(model_design),
                   adjust.method = "BH",
                   number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(normalized_count_data[,51:52],
                     topfit,
                     by.y=0,by.x=1,
                     all.y=TRUE)
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
#number of genes that passed the p-value threshold
#low values not in our data but that could result means the variation in biological variation in patients is different
length(which(output_hits$P.Value < 0.05))
## [1] 2892
#How many genes passed correction
length(which(output_hits$adj.P.Val < 0.05))
## [1] 473

Taking into account Patient variability

model_design_pat <- model.matrix(
  ~ samples$patients + samples$cell_type)
fit_pat <- lmFit(minimalSet, model_design_pat)
fit2_pat <- eBayes(fit_pat,trend=TRUE)

topfit_pat <- topTable(fit2_pat, 
                   coef=ncol(model_design_pat),
                   adjust.method = "BH",
                   number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits_pat <- merge(normalized_count_data[,51:52],
                         topfit_pat,by.y=0,by.x=1,all.y=TRUE)
#sort by pvalue
output_hits_pat <- output_hits_pat[order(output_hits_pat$P.Value),]
length(which(output_hits_pat$P.Value < 0.05))
## [1] 1325
length(which(output_hits_pat$adj.P.Val < 0.05))
## [1] 0

QLF

d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
d <- estimateDisp(d, model_design_pat)
fit <- glmQLFit(d, model_design_pat)
qlf.pos_vs_neg <- glmQLFTest(fit, coef='samples$cell_typePOS')
kable(topTags(qlf.pos_vs_neg), type="html",row.names = FALSE)
logFC logCPM F PValue FDR
8.508844 3.630861 56.94465 0 0.00e+00
9.217790 3.372834 44.85154 0 2.00e-07
8.238509 3.548781 39.51653 0 1.60e-06
8.480522 3.438800 38.21139 0 2.40e-06
7.357389 3.446313 35.77654 0 6.60e-06
14.042290 8.460151 34.52607 0 9.50e-06
8.560409 2.932483 34.40267 0 9.50e-06
10.527753 7.518887 32.50111 0 2.20e-05
7.998298 3.187208 31.11275 0 3.99e-05
7.616290 3.501343 29.92408 0 6.62e-05
x
BH
x
samples$cell_typePOS
x
glm

P-values were corrected using Quasilikelihood method. Quasilikelihood is better becacuse it is tailored towards RNAseq data

qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue",
                           n = nrow(normalized_count_data))
length(which(qlf_output_hits$table$PValue < 0.05))
## [1] 1569
length(which(qlf_output_hits$table$FDR < 0.05))
## [1] 276

Answer to question

LIMMA

1.Genes Significantly expressed(Not taking into account patient variability): 2892

Genes Significantly expressed(Taking into account patient variability): 1325

Threshold: 0.05; Used a globally accepted threshold value good for general purposed even in RNASEQ

  1. Genes passed correction (Not taking into account patient variability): 473

Genes passed correction (Taking into account patient variability): 0

QLF 1.Genes Significantly expressed: 1569

Threshold: 0.05; Used a globally accepted threshold value good for general purposed even in RNASEQ

2.Genes passed correction: 276

3. Volcano plot

Code copied from (Bonnin 2020)

volcanoData <- qlf_output_hits$table
volcanoData$geneName <- normalized_count_data$hgnc_symbol
volcanoData$regulated <- "Not significant"
volcanoData$regulated[volcanoData$PValue < 0.05 & volcanoData$logFC > 0] <- "Up regulated"
volcanoData$regulated[volcanoData$PValue < 0.05 & volcanoData$logFC < 0] <- "Down regulated"

volcanoData$vlabel <- NA

volcanoData$vlabel[volcanoData$regulated != "Not significant"] <- volcanoData$geneName[volcanoData$regulated != "Not significant"]


ggplot(data = volcanoData, aes(x = logFC, y = -log10(FDR), col = regulated, label = vlabel)) +
  geom_point() +
  theme_minimal() +
  scale_color_manual(values=c("blue", "black", "red")) +
  geom_vline(xintercept=c(-0.6, 0.6), col="red") +
  geom_hline(yintercept=-log10(0.05), col="red") +
  geom_text_repel()

Figure D: Volcano plot showing differential expressed genes highlighting genes that are statisically significant

4. Heatmap

heatmap_matrix <- normalized_count_data[,1:50]
rownames(heatmap_matrix) <- normalized_count_data$ensembl_gene_id

Heatmap before differential expression analysis

heatmap_matrix <- t(scale(t(heatmap_matrix)))
if(min(heatmap_matrix) == 0){
    heatmap_col = colorRamp2(c( 0, max(heatmap_matrix)), 
                      c( "white", "red"))
  } else {
    heatmap_col = colorRamp2(c(min(heatmap_matrix), 0,
        max(heatmap_matrix)), c("blue", "white", "red"))
  }

current_heatmap <- Heatmap(as.matrix(heatmap_matrix),
      show_row_dend = TRUE,show_column_dend = TRUE, 
      col=heatmap_col,show_column_names = TRUE, 
      show_row_names = FALSE,show_heatmap_legend = TRUE)
current_heatmap

Figure E: Heatmap of genes before conducting any differential analysis. On a scale from Blue to Red going negative to positive. ### LIMMA Heatmap

top_hits <- output_hits_pat$ensembl_gene_id[
  output_hits_pat$P.Value<0.05]
heatmap_matrix_tophits <- t(
  scale(t(heatmap_matrix[
    which(rownames(heatmap_matrix) %in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
    heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)), 
                             c( "white", "red"))
  } else {
    heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
      max(heatmap_matrix_tophits)), c("blue", "white", "red"))
  }
limma_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = TRUE,
                               show_row_dend = TRUE,
                               show_column_dend = TRUE, 
                               col=heatmap_col,
                               show_column_names = TRUE, 
                               show_row_names = FALSE,
                               show_heatmap_legend = TRUE,
                               )
limma_heatmap

Figure F: Heatmap of genes conducting LIMMA patient model differential gene expression analysis. On a scale from Blue to Red going negative to positive.

QLF Heatmap

top_hits <- rownames(qlf_output_hits$table)[
  output_hits_pat$P.Value<0.05]
heatmap_matrix_tophits <- t(
  scale(t(heatmap_matrix[which(rownames(heatmap_matrix) 
      %in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
    heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)), 
                             c( "white", "red"))
  } else {
    heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, 
                      max(heatmap_matrix_tophits)), 
                      c("blue", "white", "red"))
  }
qlf_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
                           cluster_rows = TRUE,
                           cluster_columns = TRUE,
                               show_row_dend = TRUE,
                               show_column_dend = TRUE, 
                               col=heatmap_col,
                               show_column_names = TRUE, 
                               show_row_names = FALSE,
                               show_heatmap_legend = TRUE,
                               )
qlf_heatmap

Figure G: Heatmap of genes conducting QLF patient model differential gene expression analysis. On a scale from Blue to Red going negative to positive.

QLF, and LIMMA comparison

qlf_pat_model_pvalues <- data.frame(
          ensembl_id = rownames(qlf_output_hits$table),
          qlf_patient_pvalue=qlf_output_hits$table$PValue)
limma_pat_model_pvalues <-  data.frame(
          ensembl_id = output_hits_pat$ensembl_gene_id,
          limma_patient_pvalue = output_hits_pat$P.Value)
two_models_pvalues <- merge(qlf_pat_model_pvalues,
                            limma_pat_model_pvalues,
                            by.x=1,by.y=1)
two_models_pvalues$colour <- "black"
two_models_pvalues$colour[two_models_pvalues$qlf_patient_pvalue
                          <0.05] <- "orange"
two_models_pvalues$colour[two_models_pvalues$limma_patient_pvalue
                          <0.05] <- "blue"
two_models_pvalues$colour[two_models_pvalues$qlf_patient_pvalue
                          <0.05 &                            
two_models_pvalues$limma_patient_pvalue<0.05] <- "red"
plot(two_models_pvalues$qlf_patient_pvalue,
     two_models_pvalues$limma_patient_pvalue,
     col = two_models_pvalues$colour,
     xlab = "QLF patient model p-values",
     ylab ="Limma Patient model p-values",
     main="QLF vs Limma")

Figure H: Visualizing p-value calculated from LIMMA and QLF. X-axis QLF patient model, Y-axis LIMMA patient model

Answer to question

  1. First the heat map using LIMMA we could see some patterns between the NEG, an POS. They seem to be clustering together a bit. Not all but most of the NEG results have a clustering of gene expression in the upper half of the genes.

Using QLF the NEG seem to be clustering together also.

Thresholded over-representation analysis

Write to file upregulated, and downregulated genes

Which ones are upregulated and downregulated

length(which(qlf_output_hits$table$PValue < 0.05 
             & qlf_output_hits$table$logFC > 0))
## [1] 1424
length(which(qlf_output_hits$table$PValue < 0.05 
             & qlf_output_hits$table$logFC < 0))
## [1] 145
qlf_output_hits_withgn <- merge(randomSample[,51:52],qlf_output_hits, by.x=1, by.y = 0)
#number higher the lower the pvalue, and if it is upregulated number is positive, and negative for downregulated
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
upregulated_genes <- qlf_output_hits_withgn$`mergedDataNoDup$HUGO`[
  which(qlf_output_hits_withgn$PValue < 0.05 
             & qlf_output_hits_withgn$logFC > 0)]
downregulated_genes <- qlf_output_hits_withgn$`mergedDataNoDup$HUGO`[
  which(qlf_output_hits_withgn$PValue < 0.05 
             & qlf_output_hits_withgn$logFC < 0)]
write.table(x=upregulated_genes,
            file=file.path("data","upregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
            file=file.path("data","downregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=data.frame(genename= qlf_output_hits_withgn$`mergedDataNoDup$HUGO`,F_stat= qlf_output_hits_withgn$rank),
            file=file.path("data","ranked_genelist.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

Screenshot of g:profiler query

Upregulated GO:

Figure 1a: Upregulated Genes GO

Upregulated REAC:

Figure 1b: Upregulated Genes REAC

Upregulated WIKI:

Figure 1c: Upregulated Genes WIKI

Downregulated GO:

Figure 2a: Downregulated Genes GO

Downregulated REAC:

Figure 2b: Downregulated Genes GO

Downregulated WIKI:

Figure 2c: Downregulated Genes GO

ALL GO(Split into 8 queries due to large gene size):

Figure 3a: All Genes GO (1/8) Figure 3b: All Genes GO (2/8) Figure 3c: All Genes GO (3/8) Figure 3d: All Genes GO (4/8) Figure 3e: All Genes GO (5/8) Figure 3f: All Genes GO (6/8) Figure 3g: All Genes GO (7/8) Figure 3h: All Genes GO (8/8)

ALL REAC(Split into 8 queries due to large gene size):

Figure 3i: All Genes REAC (1/8) Figure 3j: All Genes REAC (2/8) Figure 3k: All Genes REAC (3/8) Figure 3l: All Genes REAC (4/8) Figure 3m: All Genes REAC (5/8) Figure 3n: All Genes REAC (6/8) Figure 3o: All Genes REAC (7/8) Figure 3p: All Genes REAC (8/8)

ALL WIKI(Split into 8 queries due to large gene size):

Figure 3q: All Genes WIKI (1/8) Figure 3r: All Genes WIKI (2/8) Figure 3s: All Genes WIKI (3/8) Figure 3t: All Genes WIKI (4/8) Figure 3u: All Genes WIKI (5/8) Figure 3v: All Genes WIKI (6/8) Figure 3w: All Genes WIKI (7/8) Figure 3x: All Genes WIKI (8/8)

  1. Which method did you choose and why?

Using the g:Profiler since this method is updated frequently. Threshold: 0.05. Significance threshold: Benjamini-Hochberg FDR Data SourceS: GO biological process, Reactome, Wikipathways

  1. What annotation data did you use and why? What version of the annotation are you using?

Using the following annotations.

Version: GRCh38.p13

GO:BP : annotations: BioMart, Release 2022-12-04

Reactome:Annotations: Biomart, 2022-12-28

Wikipathways: 2022-12-28

  1. How many genesets were returned with what thresholds?

Threshold: 0.05

Upregulated:

GO: 379, REAC: 14, WP: 35

Downregulated:

GO: 180, REAC: 61, WP: 11

ALL: Used the list from ranked_genelist.txt. Since gene set so large need to run 2000 genes at a time, or else the browser crashes. This poses a problem where the statistical power of the gene list changes as the contents are different for each 2000 set. However the most important set is the first 2000, and the last 2000. Where the pvalue is statistically significant.

GO:444 REAC:86 WP:12

GO:102 REAC:31 WP:0

GO:82 REAC:6 WP:0

GO:0 REAC:1 WP:0

GO:1 REAC:9 WP:0

GO:60 REAC:3 WP:3

GO: 439 REAC: 12 WP: 48

Total: GO:1128 REAC:136 WP:63

  1. Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?

The first 2000 had similar values with downregulated gene.The upregulated genes gave different results for example interferon related pathways compared to common cytoplasmic translation that came up in downregulated and the whole set. The last 2000 I assumed it would be similar to upregulated however not many pathways seemed to be in common

Interpretation

  1. Do the over-representation results support conclusions or mechanism discussed in the original paper? Upregulation shows in our Gprofiler query the response to type II interferon. In our paper it mentions the “induction of interferon stimulated genes”. Also “expression of interferon-responsive genes” (Lieberman et al. 2020).

  2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

I found an article that states the SARS-CoV-2 positive thyroid tissue of patients activated Type I and Type II interferon signaling. We can see in our gprofiler query of upregulated genes that there is a significant pathway of responding to type II interferon (Poma et al. 2021).

References

Bonnin, Sarah. 2020. “CRG r Introduction. Volcano Plots.” 2020. https://biocorecrg.github.io/CRG_RIntroduction/volcano-plots.html.
Durinck, Steffen, Paul T. Spellman, Ewan Birney, and Wolfgang Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt.” Nature Protocols 4: 1184–91.
Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btw313.
Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in r.” Bioinformatics 30: 2811–12.
Lieberman, Nicole AP, Vikas Peddu, Hong Xie, Lasata Shrestha, Meei-Li Huang, Megan C Mears, Maria N Cajimat, et al. 2020. “In Vivo Antiviral Host Transcriptional Response to SARS-CoV-2 by Viral Load, Sex, and Age.” PLoS Biology 18 (9): e3000849.
Morgan, Martin. 2022. BiocManager: Access the Bioconductor Project Package Repository. https://CRAN.R-project.org/package=BiocManager.
Poma, Anna Maria, Alessandro Basolo, Daniele Bonuccelli, Agnese Proietti, Elena Macerola, Clara Ugolini, Ludovica Torregrossa, et al. 2021. “Activation of Type i and Type II Interferon Signaling in SARS-CoV-2-Positive Thyroid Tissue of Patients Dying from COVID-19.” Thyroid : Official Journal of the American Thyroid Association 31 (12): 1766–75. https://doi.org/10.1089/thy.2021.0345.
Raudvere, Uku, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, and Jaak Vilo. 2019. “G:profiler: A Web Server for Functional Enrichment Analysis and Conversions of Gene Lists.” Nucleic Acids Research. https://doi.org/10.1093/nar/gkz369.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.
Slowikowski, Kamil. 2023. Ggrepel: Automatically Position Non-Overlapping Text Labels with ’Ggplot2’. https://CRAN.R-project.org/package=ggrepel.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Xie, Yihui. 2023. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.
Zhu, Yuelin, Sean Davis, Robert Stephens, Paul S. Meltzer, and Yidong Chen. 2008. “GEOmetadb: Powerful Alternative Search Engine for the Gene Expression Omnibus.” Bioinformatics (Oxford, England) 24 (23): 2798–2800. https://doi.org/10.1093/bioinformatics/btn520.